This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
Please check http://www.gnu.org/licenses/.
The present script takes the CARNIVAL results inferred from transcriptomics data from intestinal organoids treated with Sars-CoV-2 for 60 hours and makes some enrichment analysis and clustering.
The original transcriptomic dataset is coming from the following publication: Lamers et al. 2020.
The differential expression analysis was conducted by Martina Poletti (Martina.Poletti@earlham.ac.uk)
We first load the required libraries and some important functions to analyse CARNiVAL output.
library(readr)
library(fgsea)
library(dplyr)
library(ggplot2)
library(piano)
library(tidyr)
library(ggsci)
library(igraph)
library(gprofiler2)
library(plotly)
## Function to extract the nodes that appear in CARNIVAL network and the
## background genes (all genes present in the prior knowledge network).
## It returns a list with two objects: the success and the background genes.
extractCARNIVALnodes <- function(CarnivalResults){
CarnivalNetwork <-
as.data.frame(CarnivalResults$weightedSIF, stringsAsFactors = FALSE)
colnames(CarnivalNetwork) <- c("source", "sign", "target", "Weight")
## We define the set of nodes interesting for our condition
sucesses <- unique(c(gsub("_.*","",CarnivalNetwork$source),
gsub("_.*","",CarnivalNetwork$target)))
CarnivalAttributes <- as.data.frame(CarnivalResults$nodesAttributes,
stringsAsFactors = FALSE)
## We define the background as all the genes in our prior knowledge network.
bg <- unique(gsub("_.*","",CarnivalAttributes$Node))
return(list(sucesses = sucesses, bg= bg))
}
### Function to print a barplot with the enriched pathways.
BarplotEnrichment <- function(PathwaysSelect, Interesting_pathways){
p <- ggplot(PathwaysSelect, aes(x = reorder(pathway, pvalue),
y = -log10(pvalue))) +
geom_bar(aes(fill = mean_stat), stat = "identity") +
scale_fill_gradient2(low = "darkblue", high = "indianred",
mid = "whitesmoke", midpoint = 0) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1,
colour = ifelse(levels(reorder(PathwaysSelect$pathway,
PathwaysSelect$pvalue)) %in% Interesting_pathways,
"red", "grey40"),
face = ifelse(levels(reorder(PathwaysSelect$pathway,
PathwaysSelect$pvalue)) %in% Interesting_pathways,
"bold", "plain")),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
xlab("")
return(p)
}
We then read the CARNIVAL results generated in the previous script. We define two different gene sets in order tor conduct the enrichment. The first set contains the nodes that appear in the CARNIVAL output and are therefore relevant in the context of our input transcriptomic data. The second set contains all the genes in our prior knowledge network which are used as the backgroud.
carnival_results_inhibition <-
readRDS("Carnival_Results/carnival_results_top50tf_pleitropic_minsize15.rds")
nodes_carnival_inhibition <- extractCARNIVALnodes(carnival_results_inhibition)
carnival_results_activation <-
readRDS("Carnival_Results/carnival_results_top50tf_pleitropic_minsize15_activation.rds")
nodes_carnival_activation <- extractCARNIVALnodes(carnival_results_activation)
carnival_results_undefined <-
readRDS("Carnival_Results/carnival_results_top50tf_pleitropic_minsize15_undefinedEffect.rds")
nodes_carnival_undefined <- extractCARNIVALnodes(carnival_results_undefined)
We downloaded from MSigDB https://www.gsea-msigdb.org/ the following dataset: c2.cp.v7.0.symbols.gmt. It contains several pathways from different resources and the genes that are known to be involved in those pathways.
Pathway_signatures <- loadGSC("InputData/c2.cp.v7.0.symbols.gmt")
We read the results from the differential expression analysis. The statistic of the genes will be mapped later on in the different significant pathways. Maybe, this is not very informative in this context.
DEA_results <- read_csv("InputData/Clevers_degs_all.csv") %>%
dplyr::select(Gene, stat)
## Parsed with column specification:
## cols(
## Gene = col_character(),
## baseMean = col_double(),
## log2FoldChange = col_double(),
## lfcSE = col_double(),
## stat = col_double(),
## pvalue = col_double(),
## padj = col_double()
## )
mean_stat <- unlist(lapply(Pathway_signatures$gsc, function(x, DEA_results) {
genes_matching <- x[which(x %in% DEA_results$Gene)]
mean_genes <- dplyr::filter(DEA_results, Gene %in% genes_matching) %>%
dplyr::pull(stat) %>% mean(na.rm = TRUE)
return(mean_genes)
}, DEA_results = DEA_results)) %>%
as.data.frame() %>%
tibble::rownames_to_column(var = "pathway") %>%
dplyr::rename(mean_stat = ".")
### Trying with Carnival node activity
carnival_avg_node_act <- carnival_results_inhibition$nodesAttributes %>%
as.data.frame() %>%
dplyr::mutate(AvgAct = as.numeric(AvgAct))
mean_stat_carni <-
unlist(lapply(Pathway_signatures$gsc, function(x, carnival_avg_node_act) {
genes_matching <- x[which(x %in% carnival_avg_node_act$Node)]
mean_genes <- dplyr::filter(carnival_avg_node_act, Node %in% genes_matching) %>%
dplyr::pull(AvgAct) %>% mean(na.rm = TRUE)
return(mean_genes)
}, carnival_avg_node_act = carnival_avg_node_act)) %>%
as.data.frame() %>%
tibble::rownames_to_column(var = "pathway") %>%
dplyr::rename(mean_stat = ".")
Using the Piano R package, we run a gene set analysis (GSA) based on a list of significant genes (CARNIVAL nodes) and a gene set collection (background). It uses Fisher’s exact test.
First for CARNIVAL results when we assume that the viral-host proteins interactions are inhibitory:
hyper_results_inhibition <-
runGSAhyper(genes = nodes_carnival_inhibition$sucesses,
# pvalues = rep(0,length(nodes_carnival_inhibition$sucesses)),
pcutoff = 0.01,
universe = nodes_carnival_inhibition$bg, gsc = Pathway_signatures,
gsSizeLim = c(4,Inf), adjMethod = "fdr")
## Warning in runGSAhyper(genes = nodes_carnival_inhibition$sucesses, pcutoff = 0.01, : there are genes in gsc that are not in the universe, these will
## be removed before analysis
## Analyzing the overrepresentation of 85 genes of interest in 2054 gene sets, using a background of 6811 non-interesting genes.
enriched_pathways_inhibition <- hyper_results_inhibition$resTab %>%
as.data.frame() %>%
tibble::rownames_to_column(var = "pathway") %>%
as_tibble() %>%
dplyr::filter(`Adjusted p-value` <= 0.01) %>%
dplyr::inner_join(mean_stat) %>%
dplyr::select(pathway, `p-value`, `Adjusted p-value`, mean_stat) %>%
dplyr::rename(pvalue = `p-value`, AdjPvalu = `Adjusted p-value`) %>%
dplyr::mutate(pathway = as.factor(pathway))
## Joining, by = "pathway"
interesting_pathways_inhibition <- c()
p_inhibition <- BarplotEnrichment(enriched_pathways_inhibition,
interesting_pathways_inhibition)
Then, for CARNIVAL when we assume that the viral-host proteins interactions are stimulations:
hyper_results_activation <-
runGSAhyper(genes = nodes_carnival_activation$sucesses,
# pvalues = rep(0,length(nodes_carnival_activation$sucesses)),
pcutoff = 0.01,
universe = nodes_carnival_activation$bg, gsc = Pathway_signatures,
gsSizeLim = c(4,Inf), adjMethod = "fdr")
## Warning in runGSAhyper(genes = nodes_carnival_activation$sucesses, pcutoff = 0.01, : there are genes in gsc that are not in the universe, these will
## be removed before analysis
## Analyzing the overrepresentation of 86 genes of interest in 2054 gene sets, using a background of 6810 non-interesting genes.
enriched_pathways_activation <- hyper_results_activation$resTab %>%
as.data.frame() %>%
tibble::rownames_to_column(var = "pathway") %>%
as_tibble() %>%
dplyr::filter(`Adjusted p-value` <= 0.01) %>%
dplyr::inner_join(mean_stat) %>%
dplyr::select(pathway, `p-value`, `Adjusted p-value`, mean_stat) %>%
dplyr::rename(pvalue = `p-value`, AdjPvalu = `Adjusted p-value`) %>%
dplyr::mutate(pathway = as.factor(pathway))
## Joining, by = "pathway"
interesting_pathways_activation <- c()
p_activation <- BarplotEnrichment(enriched_pathways_activation,
interesting_pathways_activation)
Then, for CARNIVAL when we assume that the viral-host proteins interactions are undefined and we let CARNIVAL infer the effect: (Note: For visualization purposes, we reduce here the significant p-value to 0.0001)
hyper_results_undefined <-
runGSAhyper(genes = nodes_carnival_undefined$sucesses,
# pvalues = rep(0,length(nodes_carnival_activation$sucesses)),
pcutoff = 0.01,
universe = nodes_carnival_undefined$bg, gsc = Pathway_signatures,
gsSizeLim = c(4,Inf), adjMethod = "fdr")
## Warning in runGSAhyper(genes = nodes_carnival_undefined$sucesses, pcutoff = 0.01, : there are genes in gsc that are not in the universe, these will
## be removed before analysis
## Analyzing the overrepresentation of 141 genes of interest in 2054 gene sets, using a background of 6741 non-interesting genes.
enriched_pathways_undefined <- hyper_results_undefined$resTab %>%
as.data.frame() %>%
tibble::rownames_to_column(var = "pathway") %>%
as_tibble() %>%
dplyr::filter(`Adjusted p-value` <= 0.0001) %>%
dplyr::inner_join(mean_stat) %>%
dplyr::select(pathway, `p-value`, `Adjusted p-value`, mean_stat) %>%
dplyr::rename(pvalue = `p-value`, AdjPvalu = `Adjusted p-value`) %>%
dplyr::mutate(pathway = as.factor(pathway))
## Joining, by = "pathway"
interesting_pathways_undefined <- c()
p_undefined <- BarplotEnrichment(enriched_pathways_undefined,
interesting_pathways_undefined)
Now I compare the enriched pathways that change the most between both assumptions: host-viral interactions are inhibitory or stimulatory.
results_activation <- hyper_results_activation$resTab %>%
as.data.frame() %>%
tibble::rownames_to_column(var = "pathway") %>%
as_tibble() %>%
dplyr::mutate(LogpValue_activation = -log(`p-value`),
p_value_activation = `p-value`) %>%
dplyr::select(pathway, p_value_activation, LogpValue_activation)
results_inhibition <- hyper_results_inhibition$resTab %>%
as.data.frame() %>%
tibble::rownames_to_column(var = "pathway") %>%
as_tibble() %>%
dplyr::mutate(LogpValue_inhibition = -log(`p-value`),
p_value_inhibition = `p-value`) %>%
dplyr::select(pathway, p_value_inhibition, LogpValue_inhibition)
results_join <-
dplyr::inner_join(results_activation, results_inhibition) %>%
dplyr::filter(p_value_activation < 0.05 | p_value_inhibition < 0.05) %>%
dplyr::mutate(diff = LogpValue_activation - LogpValue_inhibition) %>%
dplyr::top_n(15, wt = abs(diff)) %>%
dplyr::arrange(desc(abs(diff)))
## Joining, by = "pathway"
results_join_longer <- results_join %>%
dplyr::select(pathway, diff, LogpValue_activation, LogpValue_inhibition) %>%
dplyr::rename(activation = "LogpValue_activation",
inhibition = "LogpValue_inhibition") %>%
pivot_longer(-c(pathway,diff), values_to = "LogPvalue",
names_to = "Interaction")
point_plot_activation_vs_inhibtion <-
ggplot(results_join_longer, aes(reorder(pathway, abs(diff)), LogPvalue)) +
geom_point(aes(color = Interaction), size = 3) +
coord_flip() +
theme_minimal() +
theme(legend.position = "bottom", legend.justification = "center") +
scale_color_lancet() +
theme(axis.text.x = element_text(hjust = 1, size =8.5),
axis.text.y = element_text(size = 7),
panel.grid.minor = element_blank()) +
xlab("Pathways") + ylab("-Log (p-value)") +
geom_hline(yintercept = -log(0.05), linetype="dashed",
color = "#2F4F4F", size=0.5)
point_plot_activation_vs_inhibtion
For the clustering part, we are going to take our CARNIVAL results when the effect of the viral proteins on the host proteins is undefined. We first convert our CARNIVAL output network into an igraph object. Then, we perform different types of clustering and select the one with the largest modularity for further analysis.
carnival_results_igraph <- carnival_results_undefined$weightedSIF %>%
as.data.frame() %>% dplyr::select(-Sign) %>%
dplyr::rename(weight = "Weight") %>%
igraph::graph_from_data_frame(directed = FALSE)
c_edgeBetweeness <-
cluster_edge_betweenness(carnival_results_igraph)
c_FastGreedy <-
cluster_fast_greedy(carnival_results_igraph,
merges = TRUE,
modularity = TRUE,
membership = TRUE,
weights = E(carnival_results_igraph)$weight)
c_Infomap <-
cluster_infomap(carnival_results_igraph,
e.weights = E(carnival_results_igraph)$weight,
v.weights = NULL, nb.trials = 10, modularity = TRUE)
c_labelProp <-
cluster_label_prop(carnival_results_igraph,
weights = E(carnival_results_igraph)$weight)
c_leadingEig <-
cluster_leading_eigen(carnival_results_igraph,
weights = E(carnival_results_igraph)$weight)
c_Louvain<-
cluster_louvain(carnival_results_igraph,
weights = E(carnival_results_igraph)$weight)
c_walktrap <-
cluster_walktrap(carnival_results_igraph,
weights = E(carnival_results_igraph)$weight,
steps = 4, merges = TRUE, modularity = TRUE, membership = TRUE)
modularity_df <- data.frame(method = c("EdgeBetweeness", "FastGreedy",
"Infomap","LabelPropagation","LeadingEigenvector", "Louvain","Walktrap"),
modularity = c(max(c_edgeBetweeness$modularity),
max(c_FastGreedy$modularity),
max(c_Infomap$modularity),
max(c_labelProp$modularity),
max(c_leadingEig$modularity),
max(c_Louvain$modularity),
max(c_walktrap$modularity))) %>%
dplyr::arrange(desc(modularity))
modularity_df
## method modularity
## 1 FastGreedy 0.6792753
## 2 Louvain 0.6736467
## 3 Infomap 0.6436640
## 4 LeadingEigenvector 0.6427823
## 5 EdgeBetweeness 0.6185302
## 6 Walktrap 0.6021636
## 7 LabelPropagation 0.5763919
We then selected the communities generated by the Fast Greedy method. We will perform enrichment analysis on them.
## Number of genes in the different communities.
table(c_FastGreedy$membership)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13
## 16 16 26 15 14 15 13 5 6 5 4 3 3
## We perform an Enrichment Analysis for each Community:
n <- length(unique(c_FastGreedy$membership))
gene_communities <-
data.frame(Gene = c_FastGreedy$names, Community = c_FastGreedy$membership)
p <- list()
for (i in seq_len(n)){
current_genes <- dplyr::filter(gene_communities, Community==i) %>%
dplyr::pull(Gene)
current_result <- gost(current_genes, user_threshold = 0.01,
correction_method = "fdr", custom_bg = nodes_carnival_undefined$bg,
sources = c("GO","KEGG","REAC","WP"))
if (!is.null(current_result)){
#currentfile <-
# paste0("02_analysis_carnival_results_files/figure-gfm/enrichment_cluster_", i, ".png")
p[[i]] <- gostplot(current_result, capped = FALSE, interactive = TRUE)
}
}
We finally show the enrichment results for the different clusters. One can find in the following link the CARNIVAL output network with the different clusters grouped along with the most significantly enriched processes:
Some of the most interesting results are the following:
Cluster 2 is enriched in the signaling of the TGF beta pathway. Potential treatments blocking this pathway have been proposed againts Covid-19.
Cluster 3 is enriched in the hedgehog signaling pathway.The majority of the proteins in this cluster are unactive.
Cluster 5 is enriched in cell cycle arrest. Some studies have suggested the arrest of the cell cycle after SARS-CoV-2 infection.
Cluster 6 is enriched in the androgen receptor signaling pathway. There are some studies suggesting that high levels of androgens might increase the risk of severe infection and death from COVID-19.
Cluster 10 is enriched in interteron signaling.
Enrichment Cluster 1:
dplyr::filter(gene_communities, Community == 1) %>% dplyr::pull(Gene)
## [1] "STAT3" "PAX7" "CARM1" "ELK1" "TOPBP1" "EGR1" "CCND2" "PRPF4B" "ATR" "CCNA2" "JUNB" "RBPJ" "KMT2A" "GTF2A2" "CRTC2" "CDKN2D"
p[[1]]
Enrichment Cluster 2:
dplyr::filter(gene_communities, Community == 2) %>% dplyr::pull(Gene)
## [1] "CSNK2B" "E2F3" "CCNB1" "TFDP1" "SMAD9" "CREBBP_EP300" "SNIP1" "ZRANB2" "MYOM2"
## [10] "STUB1" "E2F2" "BAG2" "CDK9" "BRD4" "SMAD1" "TBP"
p[[2]]
Enrichment Cluster 3:
dplyr::filter(gene_communities, Community == 3) %>% dplyr::pull(Gene)
## [1] "PRKACA" "GTF2F2" "SMO" "PRKAR2B" "HMGN1" "CREM" "RPS6KB1" "NUP210" "NUP98" "NUP214" "PRKAA2" "NUP88" "PRKACB" "RAE1"
## [15] "PRKAR2A" "GLI3" "PSMD11" "SKP1" "SHH" "PPP2R1A" "ARRB2" "PSMD1" "CREB3" "ZEB2" "GLI2" "ARID3A"
p[[3]]
Enrichment Cluster 4:
dplyr::filter(gene_communities, Community == 4) %>% dplyr::pull(Gene)
## [1] "MAPK11" "SRPK1" "TLE1" "RALA" "RUNX1" "ATF6" "MAPK9" "GNB1" "MAX" "CSNK2A2" "HBP1" "MTA1" "PDX1" "CEBPA"
## [15] "SREBF2"
p[[4]]
Enrichment Cluster 5:
dplyr::filter(gene_communities, Community == 5) %>% dplyr::pull(Gene)
## [1] "BRCA1" "CDK4" "RBX1" "E2F4" "BCL3" "CDKN1C" "CRTC3" "CALM1" "NFATC2" "CDKN2A" "SOX2" "MXI1" "FOXM1" "ZBED1"
p[[5]]
Enrichment Cluster 6:
dplyr::filter(gene_communities, Community == 6) %>% dplyr::pull(Gene)
## [1] "SIRT1" "CDK1" "HDAC1" "HIST1H4F" "NCOA3" "NCOA2" "KAT2B" "SIRT2" "MBTPS1" "SCAP" "NR0B2" "SREBF1" "UBTF"
## [14] "NR5A2" "IRF2"
p[[6]]
Enrichment Cluster 7:
dplyr::filter(gene_communities, Community == 7) %>% dplyr::pull(Gene)
## [1] "ITGB1" "NCOA1" "SMAD4" "CASP7" "NCOR1" "CASP3" "GNAS" "AKT2" "PPP1CB" "AKT3" "TBK1" "MEF2A" "MEF2B"
p[[7]]
Enrichment Cluster 8:
dplyr::filter(gene_communities, Community == 8) %>% dplyr::pull(Gene)
## [1] "PRKAB1" "ATF3" "FASN" "ELAVL1" "KLF6"
p[[8]]
Enrichment Cluster 10:
dplyr::filter(gene_communities, Community == 10) %>% dplyr::pull(Gene)
## [1] "LBR" "IFNA13" "B2M" "STAT2" "EOMES"
p[[10]]
Enrichment Cluster 12:
dplyr::filter(gene_communities, Community == 12) %>% dplyr::pull(Gene)
## [1] "PARP1" "HIF1A" "MYBL2"
p[[12]]
Enrichment Cluster 13:
dplyr::filter(gene_communities, Community == 13) %>% dplyr::pull(Gene)
## [1] "YAP1" "YWHAQ" "TEAD1"
p[[13]]
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8
## [6] LC_MESSAGES=en_GB.UTF-8 LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] plotly_4.9.2.1 gprofiler2_0.1.9 igraph_1.2.5 ggsci_2.9 tidyr_1.1.0 piano_2.4.0 ggplot2_3.3.2 dplyr_1.0.0
## [9] fgsea_1.14.0 readr_1.3.1
##
## loaded via a namespace (and not attached):
## [1] Biobase_2.48.0 httr_1.4.2 jsonlite_1.7.0 viridisLite_0.3.0 gtools_3.8.2 assertthat_0.2.1
## [7] shiny_1.5.0 yaml_2.2.1 slam_0.1-47 pillar_1.4.6 lattice_0.20-41 glue_1.4.1
## [13] limma_3.44.3 digest_0.6.25 promises_1.1.1 colorspace_1.4-1 htmltools_0.5.0 httpuv_1.5.4
## [19] Matrix_1.2-18 pkgconfig_2.0.3 purrr_0.3.4 xtable_1.8-4 relations_0.6-9 scales_1.1.1
## [25] gdata_2.18.0 later_1.1.0.1 BiocParallel_1.22.0 tibble_3.0.3 farver_2.0.3 generics_0.0.2
## [31] ellipsis_0.3.1 DT_0.14 withr_2.2.0 shinyjs_1.1 BiocGenerics_0.34.0 lazyeval_0.2.2
## [37] cli_2.0.2 magrittr_1.5 crayon_1.3.4 mime_0.9 evaluate_0.14 fansi_0.4.1
## [43] gplots_3.0.4 shinydashboard_0.7.1 tools_4.0.2 data.table_1.12.8 hms_0.5.3 lifecycle_0.2.0
## [49] stringr_1.4.0 dorothea_1.0.2 bcellViper_1.24.0 munsell_0.5.0 cluster_2.1.0 compiler_4.0.2
## [55] caTools_1.18.0 tinytex_0.24 rlang_0.4.7 RCurl_1.98-1.2 grid_4.0.2 rstudioapi_0.11
## [61] marray_1.66.0 htmlwidgets_1.5.1 visNetwork_2.0.9 crosstalk_1.1.0.1 labeling_0.3 bitops_1.0-6
## [67] rmarkdown_2.3 gtable_0.3.0 sets_1.0-18 R6_2.4.1 gridExtra_2.3 knitr_1.29
## [73] fastmap_1.0.1 fastmatch_1.1-0 KernSmooth_2.23-17 stringi_1.4.6 parallel_4.0.2 Rcpp_1.0.5
## [79] vctrs_0.3.2 tidyselect_1.1.0 xfun_0.15